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Abstract 

Various optimality properties of universal sequence predictors based on Bayes- 
mixtures in general, and Solomonoff 's prediction scheme in particular, will be 
studied. The probability of observing xt at time t, given past observations 
xi...xt-i can be computed with the chain rule if the true generating distribu- 
tion /X of the sequences x\X2X'i... is known. If /i is unknown, but known to 
belong to a countable or continuous class M. one can base ones prediction on 
the Bayes-mixture ^ defined as a weighted sum or integral of distributions 
ueM. The cumulative expected loss of the Bayes-optimal universal predic- 
tion scheme based on ^ is shown to be close to the loss of the Bayes-optimal, 
but infeasible prediction scheme based on /i. We show that the bounds are 
tight and that no other predictor can lead to significantly smaller bounds. 
Furthermore, for various performance measures, we show Pareto-optimality 
of ^ and give an Occam's razor argument that the choice Wv ~ 2~^'^'^^ for the 
weights is optimal, where K{v) is the length of the shortest program describ- 
ing V. The results are applied to games of chance, defined as a sequence of 
bets, observations, and rewards. The prediction schemes (and bounds) are 
compared to the popular predictors based on expert advice. Extensions to 
infinite alphabets, partial, delayed and probabilistic prediction, classification, 
and more active systems are briefly discussed. 

Keywords 

Bayesian sequence prediction; mixture distributions; Solomonoff induction; 
Kolmogorov complexity; learning; universal probability; tight loss and error 
bounds; Pareto-optimality; games of chance; classification. 
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1 Introduction 



Many problems are of the induction type in which statements about the future have 
to be made, based on past observations. What is the probabihty of rain tomorrow, 
given the weather observations of the last few days? Is the Dow Jones likely to 
rise tomorrow, given the chart of the last years and possibly additional newspaper 
information? Can we reasonably doubt that the sun will rise tomorrow? Indeed, one 
definition of science is to predict the future, where, as an intermediate step, one tries 
to understand the past by developing theories and finally to use the prediction as 
the basis for some decision. Most induction problems can be studied in the Bayesian 
framework. The probability of observing Xt at time t, given the observations Xi...Xt-i 
can be computed with the chain rule, if we know the true probability distribution, 
which generates the observed sequence X1X2X3.... The problem is that in many cases 
we do not even have a reasonable guess of the true distribution fi. What is the true 
probability of weather sequences, stock charts, or sunrises? 

In order to overcome the problem of the unknown true distribution, one can 
define a mixture distribution ^ as a weighted sum or integral over distributions 

where is any discrete or continuous (hypothesis) set including /i. 
is assumed to be known and to contain the true distribution, i.e. ^eM.. Since 
the probability ^ can be shown to converge rapidly to the true probability ^ in 
a conditional sense, making decisions based on ^ is often nearly as good as the 
infeasible optimal decision based on the unknown |MF98j . Solomonoff Sol64j 
had the idea to define a universal mixture as a weighted average over deterministic 
programs. Lower weights were assigned to longer programs. He unified Epicurus' 
principle of multiple explanations and Occam's razor [simplicity] principle into one 
formal theory (See |LV97I| for this interpretation of |Sol64j ) . Inspired by Solomonoff 's 
idea. Levin |ZL70j defined the closely related universal prior as a weighted average 
over all semi-computable probability distributions. If the environment possesses 
some effective structure at all, Solomonoff-Levin's posterior "finds" this structure 
|Sol78j ■ and allows for a good prediction. In a sense, this solves the induction 
problem in a universal way, i.e. without making problem specific assumptions. 

Section I^Jexplains notation and defines the universal or mixture distribution ^ as the 
Wjy-weighted sum of probability distributions of a set M., which includes the true 
distribution /i. No structural assumptions are made on the v. ^ multiplicatively 
dominates all v E M., and the relative entropy between yU and is bounded by 
\n.w~^. Convergence of ^ to /i in a mean squared sense is shown in Theorem ^ The 
representation of the universal posterior distribution and the case /i^Ai are briefly 
discussed. Various standard sets M. of probability measures are discussed, including 
computable, enumerable, cumulatively enumerable, approximable, finite-state, and 
Markov (semi)measures. 

Section |31 is essentially a generalization of the deterministic error bounds found 
in |Hut01bj from the binary alphabet to a general finite alphabet X. Theorem |21 
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bounds i?®'' by 0(\/E®^), where is the expected number of errors made 

by the optimal universal predictor B^, and i?®** is the expected number of errors 
made by the optimal informed prediction scheme 0^. The non-binary setting cannot 
be reduced to the binary case! One might think of a binary coding of the symbols 
Xt E X in the sequence xiX2..-- But this makes it necessary to predict a block of 
bits Xt, before one receives the true block of bits Xt, which differs from the bit by 
bit prediction scheme considered in |Sol78| IHutOlbj . The framework generalizes to 
the case where an action yt&y results in a loss i^^y^ if Xt is the next symbol of the 
sequence. Optimal universal and optimal informed prediction schemes are 
defined for this case, and loss bounds similar to the error bounds of the last section 
are stated. No assumptions on i have to be made, besides boundedness. 

Section m applies the loss bounds to games of chance, defined as a sequence of bets, 
observations, and rewards. The average profit pn^ achieved by the A^ scheme rapidly 
converges to the best possible average profit p^'' achieved by the A^ scheme {pn^ — 
= 0(n~^/^)). If there is a profitable scheme at all (p^^>£:>0), asymptotically 
the universal A^ scheme will also become profitable. Theorem El bounds the time 
needed to reach the winning zone. It is proportional to the relative entropy of n and 
^ with a factor depending on the profit range and on p^f . An attempt is made to 
give an information theoretic interpretation of the result. 

Section [5l discusses the quality of the universal predictor and the bounds. We show 
that there are and jj,^ M. and weights Wu such that the derived error bounds 
are tight. This shows that the error bounds cannot be improved in general. We 
also show Pareto-optimality of ^ in the sense that there is no other predictor which 
performs at least as well in all environments v and strictly better in at least 
one. Optimal predictors can always be based on mixture distributions ^. This still 
leaves open how to choose the weights. We give an Occam's razor argument that 
the choice Wy = 2~^^'^\ where K^u) is the length of the shortest program describing 
V is optimal. 

Section El generalizes the setup to continuous probability classes M. = {jj,0} con- 
sisting of continuously parameterized distributions with parameter G IR^. Un- 
der certain smoothness and regularity conditions a bound for the relative entropy 
between /i and ^, which is central for all presented results, can still be derived. 
The bound depends on the Fisher information of yU and grows only logarithmi- 
cally with ra, the intuitive reason being the necessity to describe 9 to an accuracy 
0{n~^^'^). Furthermore, two ways of using the prediction schemes for partial se- 
quence prediction, where not every symbol needs to be predicted, are described. 
Performing and predicting a sequence of independent experiments and online learn- 
ing of classification tasks are special cases. We also compare the universal predic- 
tion scheme studied here to the popular predictors based on expert advice (PEA) 
ILW89I IVov92l ILW941 1^^971 IHKW98I IKW99j . Although the algorithms, the set- 
tings, and the proofs are quite different, the PEA bounds and our error bound have 
the same structure. Finally, we outline possible extensions of the presented theory 
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and results, including infinite alphabets, delayed and probabilistic prediction, active 
systems influencing the environment, learning aspects, and a unification with PEA. 

Section [7| summarizes the results. 

There are good introductions and surveys of Solomonoff sequence prediction 
LV92t ILV97j . inductive inference in general jAS83| ISol9 7'. 'MF98', reasoning un- 



der uncertainty |Grii98j . and competitive online statistics [Vov99j . with interesting 
relations to this work. See Section IH?^ for some more details. 



2 Setup and Convergence 

In this section we show that the mixture ^ converges rapidly to the true distribution 
/X. After defining basic notation in Section ITTl we introduce in Section the uni- 
versal or mixture distribution ^ as the Wi,-weighted sum of probability distributions 
1/ of a set which includes the true distribution /i. No structural assumptions are 
made on the u. ^ multiplicatively dominates all z/GTVI. A posterior representation of 
^ with incremental weight update is presented in Section ing In Section we show 
that the relative entropy between /i and ^ is bounded by lnw~^ and that ^ converges 
to in a mean squared sense. The case /i^A^ is briefly discussed in Section 
The section concludes with Section which discusses various standard sets A4 of 
probability measures, including computable, enumerable, cumulatively enumerable, 
approximable, finite-state, and Markov (semi)measures. 

2.1 Random Sequences 

We denote strings over a finite alphabet X by xiX2---Xn with xt^X and t,n,N&lN 
and = We further use the abbreviations e for the empty string, Xp.n '■= 

XtXt+i-.-Xn-iXn foT t<n and e for t>n, and x^t-=Xi...Xt-i. We use Greek letters 
for probability distributions (or measures). Let p(xi...x„) be the probability that 
an (infinite) sequence starts with xi...Xn- 

We also need conditional probabilities derived from the chain rule: 

p{Xt\X<t) = p(Xl;t)/p(X<t), 

p{Xi...Xn) = p{Xi)-p{x2\Xi)-...-p{Xn\Xi...Xn-l). 

The first equation states that the probability that a string xi...Xt-i is followed 
by Xt is equal to the probability that a string starts with xi...Xt divided by the 
probability that a string starts with xi...Xt-i. For convenience we define p{xt\x<:t) = 
if p(a;<t) = 0. The second equation is the first, apphed n times. Whereas p might be 
any probability distribution, p denotes the true (unknown) generating distribution 
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of the sequences. We denote probabilities by P, expectations by E and further 
abbreviate 

Et[..] := fi{xt\x<t)[..], Ei.,n[..] ■ = Y, f^{Xl:n)[.-], E<t[..] := Yfi{x<t)[-]- 

Probabilities P and expectations E are always w.r.t. the true distribution /i. Ei.„ = 
E<;„E„ by the chain rule and E[...] =E<t[...] if the argument is independent of Xf.oo-, 
and so on. We abbreviate "with /i-probability 1" by w./x.p.l. We say that Zt 
converges to in mean sum (i.m.s.) if c:=X]j^iE[(z4 — 2;*)^] < 00. One can show 
that convergence in mean sum implies convergence with probability 1.^ Convergence 
i.m.s. is very strong: it provides a "rate" of convergence in the sense that the 
expected number of times t in which zt deviates more than e from z.^ is finite and 
bounded by c/ and the probability that the number of e-deviations exceeds -f^ is 
smaller than 5. 

2.2 Universal Prior Probability Distribution 

Every inductive inference problem can be brought into the following form: Given a 
string x<t, take a guess at its continuation xt- We will assume that the strings which 
have to be continued are drawn from a probability^ distribution /i. The maximal 
prior information a prediction algorithm can possess is the exact knowledge of /i, 
but in many cases (like for the probability of sun tomorrow) the true generating 
distribution is not known. Instead, the prediction is based on a guess p of p. We 
expect that a predictor based on p performs well, if p is close to p or converges, 
in a sense, to p. Let M. := {ui,U2,...} be a countable set of candidate probability 
distributions on strings. Results are generahzed to continuous sets ^A in Section 
16.11 We define a weighted average on ^A 

i{Xl:n) ■= Y Wu-yiXl-n), ^ = 1, > ^. (1) 

It is easy to see that ^ is a probability distribution as the weights are positive 
and normalized to 1 and the z/ G TVI are probabilities.^ For a finite Ai a possible 
choice for the w is to give all u equal weight {wi^ = -^^). We call ^ universal relative 
to A^, as it multiplicatively dominates all distributions in Ai 

(,{xi:n) > w^-u^Xi-n) for all u G M.. (2) 

In the following, we assume that Ai is known and contains the true distribution, 
i.e. pEA4. If is chosen sufficiently large, then pEA4 is not a serious constraint. 

"'^Convergence in the mean, i.e. E[(z( — z*)^] 0, only implies convergence in probability, which 
is weaker than convergence with probability 1. 

^This includes deterministic environments, in which case the probability distribution /i is 1 for 
some sequence xi;oo and for all others. We call probability distributions of this kind deterministic. 

^The weight Wi, may be interpreted as the initial degree of belief in i/ and £^{xi...Xn) as the degree 
of belief in a;i...x„. If the existence of true randomness is rejected on philosophical grounds one 
may consider Ai containing only deterministic environments. ^ still represents belief probabilities. 
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2.3 Universal Posterior Probability Distribution 

All prediction schemes in this work are based on the conditional probabilities 
p(xt|x<t). It is possible to express also the conditional probability ^(xt|x<t) as 
a weighted average over the conditional z/(xf|x<f), but now with time dependent 
weights: 

^{xt\x^t) = J2 w^{x<t)u{xt\x^t), w^{xi.,t) := w^{x<t)^^}pY^, w^{e) := w^. 

(3) 

The denominator just ensures correct normalization Yl,v'^v{,^\:t) = 1- By induc- 
tion and the chain rule we see that w^{x^t) =Wyi'{x^t) / i{x<t)- Inserting this into 
Y.v'^ui.^ <t)^{xt\x <:t) using ((T)) gives ^(xj|a;<t), which proves the equivalence of ((H) and 
dn}. The expressions © can be used to give an intuitive, but non-rigorous, argument 
why ^(xj|x<j) converges to yu(xt|x<t): The weight Wu of z/ in ^ increases/decreases if 
V assigns a high/low probability to the new symbol Xt-, given a;<t. For a //-random 
sequence /i(a;i;j):»z/(xi:t) if z/ (significantly) differs from /x. We expect the total 
weight for all v consistent with /i to converge to 1, and all other weights to converge 
to for t^oo. Therefore we expect i{xt\x^t) to converge to /i(xt|a;<t) for //-random 
strings Xi:oo- 

Expressions (jHl) seem to be more suitable than for studying convergence and 
loss bounds of the universal predictor S,-, but it will turn out that (j21) is all we need, 
with the sole exception in the proof of Theorem El Probably (jSj) is useful when one 
tries to understand the learning aspect in ^. 



2.4 Convergence of ^ to /i 



We use the relative entropy and the squared Euclidian/absolute distance to measure 
the instantaneous and total distances between /i and ^: 



dt{x<,t) 



St[,X<t) 



E^ln 



/i(Xj|x 



<t) 



i{.xt\x<t) ' 

(/i(a;t|x<t) - i{xt\x<t 



Dn := ^E<tc?t(x<j) = Ei^^ln 



IJ,{Xi-, 



t=i 

2 



Sn 



n 

^E<jSt(x<t) 



t=i 



<t, 



Vr,. 



(4) 
(5) 

(6) 



One can show that St<\ai<dt [HutOlal Sec.3.2] |C;i'91[ Lem.12.6.1], hence 5„ < 
Vn<Dn (for binary alphabet, St = \a^, hence Sn = Vn)- So bounds in terms of Sn 
are tightest, while the (implied) looser bounds in terms of as a referee pointed 
out have an advantage in case of continuous alphabets (not considered here) to be 
reparametrization-invariant. The weakening to is used, since Dn can easily be 
bounded in terms of the weight w^. 
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Theorem 1 (Convergence) Let there be sequences x\X2--- over a finite alphabet 
X drawn with probability /u(xi:„) for the first n symbols. The universal conditional 
probability ^{xt\x.^t) of the next symbol Xt given x<j is related to the true conditional 
probability fi{xt\x^t) in the following way: 

n 2 

^E<t^ (;u(xt|x<t) - ^(xt|x<t)) = Sn < Vn < Dn < lnw^^=:6^ < oo 

t=l Xt 

where dt and Dn are the relative entropies and w^j, is the weight |IJj of in ^. 

A proof for binary alphabet can be found in |Sol78l ILV97j and for a general 
finite alphabet in [ HutOlaj . The finiteness of S^o implies ^(a;^|a;<t) — /i(a;^|a;<t) ^ 
for t— i>oo i.m.s., and hence w./x.p.l for any x'^. There are other convergence results, 
most notably ^(xt|x<t)//i(xt|x<t) — > 1 for t — > oo w./z.p.l |LV97| IHut03aj . These 
convergence results motivate the belief that predictions based on (the known) ^ 
are asymptotically as good as predictions based on (the unknown) fi with rapid 
convergence. 



2.5 The Case where fi^M 

In the following we discuss two cases, where fi^Ai, but most parts of this work still 
apply. Actually all theorems remain valid for fi being a finite linear combination 
fJ'{xi:n) = J2uec'^u^(yXi;n) of z/'s in CCJU. Dominance ^(xi:n) > ■/i(a;i:n) is still 
ensured with : = min;^g£— >mini,g£Wj,. More generally, if fi is an infinite linear 
combination, dominance is still ensured if w^, itself dominates in the sense that 
Wi, > aVi, for some a > (then w^>a). 

Another possibly interesting situation is when the true generating distribution 
M., but a "nearby" distribution jl with weight w/i is in A^. If we measure the dis- 
tance of /i to /i with the Kullback-Leibler divergence D„(/x| |/i) : = X)a;i.„/^(2^i;n)ln4|^^ 
and assume that it is bounded by a constant c, then 

Dn = Ei:„ln^:^ f = Ei:„ In 9 f + Ei:„lnV7 f < InWf^' + c. 



So Dn<\nw^^ remains valid if we define w^j_: = Wfi-e ^. 



2.6 ProbabiUty Classes M 

In the following we describe some well-known and some less known probability classes 
M. . This relates our setting to other works in this area, embeds it into the historical 
context, illustrates the type of classes we have in mind, and discusses computational 
issues. 

We get a rather wide class A4 if we include all (semi) computable probability 
distributions in A^. In this case, the assumption fi^Ai is very weak, as it only 
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assumes that the strings are drawn from any (semi) computable distribution; and all 
valid physical theories (and, hence, all environments) are computable to arbitrary 
precision (in a probabilistic sense). 

We will see that it is favorable to assign high weights Wi^ to the u. Simplicity 
should be favored over complexity, according to Occam's razor. In our context this 
means that a high weight should be assigned to simple u. The prefix Kolmogorov 
complexity Klu) is a universal complexity measure |Kol65tlZL70ULV97j . It is defined 
as the length of the shortest self-delimiting program (on a universal Turing machine) 
computing z/(a;i;„) given xi^n- If we define 

:= 2-^^^^^) 

then distributions which can be calculated by short programs, have high weights. 
The relative entropy is bounded by the Kolmogorov complexity of n in this case 
(L>„<i^(/i)-ln2). Levin's universal semi- measure is obtained if we take M. = M.u 
to be the (multi)set enumerated by a Turing machine which enumerates all enu- 
merable semi-measures |/L70t ILV97j . Recently, M. has been further enlarged to 
include all cumulatively enumerable semi- measures |Sch02aj . In the enumerable and 
cumulatively enumerable cases, ^ is not finitely computable, but can still be approxi- 
mated to arbitrary but not pre-specifiable precision. If we consider all approximable 
(i.e. asymptotically computable) distributions, then the universal distribution ^, al- 
though still well defined, is not even approximable |Hut03bj . An interesting and 
quickly approximable distribution is the Speed prior 5" defined in |Sch02bj . It is re- 
lated to Levin complexity and Levin search |Lev73t ILev84j , but it is unclear for now, 
which distributions are dominated by S. If one considers only finite-state automata 
instead of general Turing machines, ^ is related to the quickly computable, universal 
finite-state prediction scheme of Feder et al. |FMG92j . which itself is related to the 
famous Lempel-Ziv data compression algorithm. If one has extra knowledge on the 
source generating the sequence, one might further reduce M. and increase w. A 
detailed analysis of these and other specific classes M. will be given elsewhere. Note 
that in the enumerable and cumulatively enumerable case, but ^^A1 in the 

computable, approximable and finite-state case. If ^ is itself in A^, it is called a 
universal element of Jv[ |LV97j . As we do not need this property here, M. may be 
any countable set of distributions. In the following sections we consider generic M. 
and w. 

We have discussed various discrete classes A^, which are sufficient from a con- 
structive or computational point of view. On the other hand, it is convenient to also 
allow for continuous classes Ai. For instance, the class of all Bernoulli processes 
with parameter [0,1] and uniform prior we = l is much easier to deal with than 
computable 6 only, with prior wg = 2~^^^\ Other important continuous classes are 
the class of i.i.d. and Markov processes. Continuous classes Ai are considered in 
more detail in Section 16.11 
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3 Error Bounds 

In this section we prove error bounds for predictors based on the mixture ^. Section 
13.11 introduces the concept of Bayes-optimal predictors 9p, minimizing p-expected 
error. In Section EUl we bound E^^—E®'^ by 0{vE^), where E^^ is the expected 
number of errors made by the optimal universal predictor Bg, and -E'®'' is the ex- 
pected number of errors made by the optimal informed prediction scheme G^. The 
proof is deferred to Section VA.'Al In Section 13.41 we generalize the framework to the 
case where an action yt&y results in a loss ^xtyt if is the next symbol of the 
sequence. Optimal universal and optimal informed prediction schemes are 
defined for this case, and loss bounds similar to the error bounds are presented. No 
assumptions on £ have to be made, besides boundedness. 

3.1 B ayes- Optimal Predictors 

We start with a very simple measure: making a wrong prediction counts as one 
error, making a correct prediction counts as no error. In [HutOlbj error bounds 
have been proven for the binary alphabet A" = {0,1}. The following generalization 
to an arbitrary alphabet involves only minor additional complications, but serves 
as an introduction to the more complicated model with arbitrary loss function. 
Let be the optimal prediction scheme when the strings are drawn from the 
probability distribution /i, i.e. the probability of Xt given x<t is /i(x(|x<f), and /i 
is known. G^ predicts (by definition) xf** when observing x<t. The prediction 
is erroneous if the true t*'' symbol is not xf^. The probability of this event is 
1— /i(a;f |x<t). It is minimized if xf^ maximizes /i(xf^|x<t). More generally, let Gp 
be a prediction scheme predicting xf'' : = argmax2,jp(xt|x<t) for some distribution p. 
Every deterministic predictor can be interpreted as maximizing some distribution. 

3.2 Total Expected Numbers of Errors 

The p-probability of making a wrong prediction for the t*^ symbol and the total 
/x-expected number of errors in the first n predictions of predictor Gp are 

ef''(x<i) := 1-Ma:fla;<t) , i?®" := X]E<ief''(x<t). (7) 

t=i 

If /i is known, Gp is obviously the best prediction scheme in the sense of making the 
least number of expected errors 

E®" < i?®" for any Gp, (8) 

since 

ef''(a;<t) = l-/i(xf''|x<t) = min{l-p(xt|x<t)} < l-p(xf''|x<i) = ef''(x<t) 



Optimality of Universal Bayesian Sequence Prediction 



11 



for any p. Of special interest is the universal predictor 0^. As ^ converges to p the 
prediction of 0^ might converge to the prediction of the optimal G^. Hence, 0^ may 
not make many more errors than and, hence, any other predictor Bp. Note that 
xf'' is a discontinuous function of p and x^^ -^x^'' cannot be proven from ^ — > /x. 
Indeed, this problem occurs in related prediction schemes, where the predictor has 
to be regularized so that it is continuous |FMG92j . Fortunately this is not necessary 
here. We prove the following error bound. 

Theorem 2 (Error Bound) Let there he sequences xiX2--- over a finite alphabet 
X drawn with probability p{xi;n) for the first n symbols. The Qp-system predicts by 
definition xf'' G X from x<t, where xf'' maximizes p{xt\x^t)- is the universal 
prediction scheme based on the universal prior ^. is the optimal informed pre- 
diction scheme. The total p-expected number of prediction errors En^ and E®" of 
and as defined in ^ are bounded in the following way 

< E^^-E^^ < pQnSn < \j2{En'+E^')Sn < Sn+^^^^S^l < 2S„+2\/^®'*^„ 

where Qn = J2t=i^<tQt (with gt(x<t):=l — 5 e^) is the expected number of non- 

optimal predictions made by 0^ and Sn<Vn<Dn<hiw~^ , where Sn is the squared 
Euclidian distance Vn half of the squared absolute distance Dn the relative 
entropy and the weight ([Tp of p in ^. 

The first two bounds have a nice structure, but the r.h.s. actually depends on 9^, 
so they are not particularly useful, but these are the major bounds we will prove, 
the others follow easily. In Section El we show that the third bound is optimal. 
The last bound, which we discuss in the following, has the same asymptotics as the 
third bound. Note that the bounds hold for any (semi)measure ^; only D„<lnpW~^ 
depends on C, dominating p with domination constant w^. 

First, we observe that Theorem |21 implies that the number of errors EZ' of 
the universal predictor is finite if the number of errors E^^' of the informed 0^ 
predictor is finite. In particular, this is the case for deterministic p, as E®'^=0 in this 
case^, i.e. 0^ makes only a finite number of errors on deterministic environments. 
This can also be proven by elementary means. Assume XiX2--- is the sequence 
generated by p and 0^ makes a wrong prediction xf^ 7^ Xj. Since ^(xf^|x<t) > 
C{xt\x<:t), this implies ^(a;f|a;<f) < |. Hence ^ = 1 < — ln,^(a;(|x<j)/ln2 = (it/ln2. If 
0g makes a correct prediction ef* = < dt /ln2 is obvious. Using this proves 
Eoo^ < Doo/ln2 < loggif A combinatoric argument given in Section shows that 

there are A4 and p&M. with E^ >log2| |. This shows that the upper bound E^ < 
logglA^I for uniform w is sharp. From Theorem |21 we get the slightly weaker bound 
E^^<2S^<2D^<2\nw-^. For more complicated probabilistic environments, where 

^Remember that we named a probability distribution deterministic if it is 1 for exactly one 
sequence and for all others. 
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even the ideal informed system makes an infinite number of errors, the theorem 

ensures that the error regret En^ — -E®^ is only of order \l En^ ■ The regret is 
quantified in terms of the information content D„ of ^ (relative to ^) , or the weight 
of /i in ^. This ensures that the error densities En/n of both systems converge to 
each other. Actually, the theorem ensures more, namely that the quotient converges 
to 1, and also gives the speed of convergence En'' / E^^' = l + 0{{E^^')^^/'^) — >1 for 
E®^^ — i> oo. If we increase the first occurrence of E®^^ in the theorem to E^ and 

the second to En'' we get the bound E^ > * — 2\/ En'^Sn, which shows that no 
(causal) predictor G whatsoever makes significantly less errors than 6^. In Section 
Elwe show that the third bound for En'^—E^'^ given in Theorem|21can in general not 
be improved, i.e. for every predictor 6 (particularly 9^) there exist A4 and n&M. 
such that the upper bound is essentially achieved. See |Hut01bj for some further 
discussion and bounds for binary alphabet. 

3.3 Proof of Theorem [2] 

The first inequality in Theorem |21 has already been proven For the second 
inequality, let us start more modestly and try to find constants A>Q and S > that 
satisfy the linear inequality 

E®« - E®- < AQn + BSn. (9) 

If we could show 

e^(x<t)-ef^(a;<0 < Aqt{x<t) + Bst{x^t) (10) 

for all t<n and all x<j, Q would follow immediately by summation and the defini- 
tion of En, Qn and Sn- With the abbreviations 

X = {l,...,N}, N=\X\, i = xt, yi = n{xt\x^t), Zi = ^{xt\x<t) 

m = Xt , s = Xt 

the various error functions can then be expressed by ef^ = l — t/s, ef^ = l—t/m, 
qt = l—Sms and St = J2i{yi — Zi)^. Inserting this into (fTUj) we get 

TV 

Vm-Vs < A[l-S„,s]+ BY^iVi- Zif- (11) 

i=l 

By definition of a;f ^ and xf^ we have ym > Vi and Zs > Zi for all i. We prove a 
sequence of inequalities which show that 

N 

Bj2iy,-z,Y + A[l-5^,s]-{yrn-ys) > ... (12) 

i=l 
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is positive for suitable A>0 and B>0, which proves (fTT|) . For m = s fjl2|) is obviously 
positive. So we will assume m 7^ s in the following. From the square we keep only 
contributions from i = m and i = s. 

... > B[{y„,-Zmy + {ys-ZsY] + A - {yrr,-ys) > ■■■ 

By definition of y, z, M. and s we have the constraints ym + Us < 1, Zm + Zs< 1, 
ym>ys>0 and Zs> Zm>0. From the latter two it is easy to see that the square 
terms (as a function of Zm and Zs) are minimized by Zm = Zs = ^{ym+Vs)- Together 
with the abbreviation x: = ym~ys we get 

... > ^Bx'^ + A-x > ... (13) 

()13p is quadratic in x and minimized by x* = ^. Inserting x* gives 

... > A- ^ > for 2AB > 1. 
2B 

Inequality (jHI) therefore holds for any A>0, provided we insert B = Thus we 
might minimize the r.h.s. of w.r.t. A leading to the upper bound 

E^^ - < for = ^ 

which is the first bound in Theorem |21 For the second bound we have to show 
Qn^En^ +E^^' , which follows by summation from Q't <et~^+ef which is equivalent 
to 1 — (5ms < 1 — 1/s + l — 1/m, which holds for m = s as well as m^s. For the third 
bound we have to prove 

\/2(iF+E®^-5„ < ^AE^^Sn + SI (14) 

If we square both sides of this expressions and simplify we just get the second bound. 
Hence, the second bound implies fll4|) . The last inequality in Theorem |21 is a simple 
triangle inequality. This completes the proof of Theorem |21 □ 
Note that also the third bound implies the second one: 

E^^-E^- < ^J2{En'+E^')Sr, ^ {E^^-E^^f < 2{E^^+E^'^)Sn ^ 

^ [Et^-Et-Snf < AEt'-Sn + Sl ^ E^^~E^^-Sn < ^AE^'Sn + Sl 
where we only have used En^ > E^f^ . Nevertheless the bounds are not equal. 

3.4 General Loss Function 

A prediction is very often the basis for some decision. The decision results in an 
action, which itself leads to some reward or loss. If the action itself can infiuence 
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the environment we enter the domain of acting agents which has been analyzed 
in the context of universal probability in [HutOlcj . To stay in the framework of 
(passive) prediction we have to assume that the action itself does not influence the 
environment. Let ixtyt be the received loss when taking action yt^y and XtEX 



is the t symbol of the sequence. We make the assumption that £ is bounded. 
Without loss of generality we normalize i by linear scaling such that < ixtyt 

< 1. 

For instance, if we make a sequence of weather forecasts X = {sunny, rainy} and 
base our decision, whether to take an umbrella or wear sunglasses y = {umbrella, 
sunglasses} on it, the action of taking the umbrella or wearing sunglasses does not 
influence the future weather (ignoring the butterfly effect). The losses might be 



Loss 


sunny 


rainy 


umbrella 


0.1 


0.3 


sunglasses 


0.0 


1.0 



Note the loss assignment even when making the right decision to take an umbrella 
when it rains because sun is still preferable to rain. 

In many cases the prediction of Xt can be identified or is already the action t/f. 
The forecast sunny can be identified with the action wear sunglasses, and rainy 
with take umbrella. X = y in these cases. The error assignment of the previous 
subsections falls into this class together with a special loss function. It assigns unit 
loss to an erroneous prediction {ixtyt = 1 Xt^yt) and no loss to a correct prediction 

{ixtxt 0)- 

For convenience we name an action a prediction in the following, even if X^y. 
The true probability of the next symbol being Xt, given x<t, is fi{xt\x^t)- The 
expected loss when predicting yt is Et[£2:tj/t]- The goal is to minimize the expected 
loss. More generally we define the Ap prediction scheme 



Vt 



argmin^p(xt|x<t 



yt^y 



xtyt 



(15) 



Xt 



which minimizes the p-expected loss. ^ As the true distribution is /i, the actual p- 
expected loss when Ap predicts the t*^ symbol and the total p-expected loss in the 
first n predictions are 



xtyt" 



t=l 



(16) 



Let A be any (causal) prediction scheme (deterministic or probabilistic does not 
matter) with no constraint at all, predicting any yf'Ey with losses and 



^argminj,(-) is defined as the y which minimizes the argument. A tie is broken arbitrarily. In 
general, the prediction space y is allowed to differ from X. If y is finite, then always exists. 
For an infinite action space y we assume that a minimizing y^'' g y exists, although even this 
assumption may be removed. 
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similarly defined as (fT^ . If /i is known, is obviously the best prediction scheme 
in the sense of achieving minimal expected loss 

L^ < for any A. (17) 



The following loss bound for the universal A^ predictor is proven in |Hut03aj . 



< < Dn + ^AL^'n^Dn + Dl < 2Dn + 2^Ln'D^. (18) 

The loss bounds have the same form as the error bounds when substituting Sn<Dn 
in Theorem |21 For a comparison to Merhav's and Feder's j MF9 8j loss bound, see 
|Hut03aj . Replacing Dn by or in flTH|) gives an invalid bound, so the general 
bound is slightly weaker. For instance, for ^" = {0,1}, ^oo = ^ii = 0, £io = l, ^oi = c<|, 
yu(l) = 0, z/(l)=2c, and w^, = Wi, = \ we get ^(l) = c, si = 2c^, yf'' = 0, /f''=£oo = 0. 



y;^« = l, /;«=4^ = c, hence L;^-L7 = c ^ 4c^ = 251 + 2^^^1 ^i- Example loss 
functions including the absolute, square, logarithmic, and Hellinger loss are discussed 
in |Hut03aj . Instantaneous error /loss bounds can also be proven: 

e?Ha;<t) - ef^(x<t) < y'2st(x<t), /f^(x<t) - /f^(x<j) < ^2dt{x<t)- 



4 Application to Games of Chance 

This section applies the loss bounds to games of chance, defined as a sequence of 
bets, observations, and rewards. After a brief introduction in Section I01 we show in 
Section 1321 that if there is a profitable scheme at all, asymptotically the universal A^ 
scheme will also become profitable. We bound the time needed to reach the winning 
zone. It is proportional to the relative entropy of n and ^ with a factor depending on 
the profit range and the average profit. Section HTSl presents a numerical example and 
Section 03 attempts to give an information theoretic interpretation of the result. 



4.1 Introduction 

Consider investing in the stock market. At time t an amount of money Sj is invested 
in portfolio yt, where we have access to past knowledge x<t (e.g. charts). After our 
choice of investment we receive new information xt, and the new portfolio value is 
rt- The best we can expect is to have a probabilistic model n of the behavior of the 
stock-market. The goal is to maximize the net //-expected profit pt = rt — St. Nobody 
knows yU, but the assumption of all traders is that there is a computable, profitable 
H they try to find or approximate. From Theorem^ we know that Levin's universal 
prior C,u{xt\x<t) converges to any computable fi{xt\x^t) with probability 1. If there is 
a computable, asymptotically profitable trading scheme at all, the Ag scheme should 
also be profitable in the long run. To get a practically useful, computable scheme 
we have to restrict to a finite set of computable distributions, e.g. with bounded 
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Levin complexity Kt |LV97j . Although convergence of ^ to /i is pleasing, what we 
are really interested in is whether is asymptotically profitable and how long it 
takes to become profitable. This will be explored in the following. 

4.2 Games of Chance 

We use the loss bound (jl8j) to estimate the time needed to reach the winning thresh- 
old when using A^ in a game of chance. We assume a game (or a sequence of possibly 
correlated games) which allows a sequence of bets and observations. In step t we 
bet, depending on the history x<t, a certain amount of money Sj, take some action 
Uti observe outcome Xt, and receive reward rt. Our profit, which we want to max- 
imize, is Pt = rt-Ste[prniwPmax], where [pmin.Pmax] IS the [minimal,maximal] profit 
per round and PA-=Pmax—Pmin the profit range. The loss, which we want to mini- 
mize, can be defined as the negative scaled profit, ixtyt = iPmax—Pt)/PA^ [O?!]- The 
probability of outcome Xt, possibly depending on the history a;<t, is //(xt|x<t). The 
total yU-expected profit when using scheme Ap is Pn'' = nPmax—PALn'' ■ If we knew /i, 
the optimal strategy to maximize our expected profit is just A^. We assume >0 
(otherwise there is no winning strategy at all, since P^'' >P^VA). Often we are 
not in the favorable position of knowing /i, but we know (or assume) that n& Ai 
for some Ai, for instance that /i is a computable probability distribution. From 
bound (fTHj) we see that the average profit per round pn^ := ^Pn^ of the universal 
Ag scheme converges to the average profit per round p^^ := ^Pn^ of the optimal 
informed scheme, i.e. asymptotically we can make the same money even without 
knowing /i, by just using the universal A^ scheme. Bound fITSj) allows us to lower 
bound the universal profit Pjf ^ 

Pn' > Pn' - PADn - \/ ^Up^ax- Pn^PADn + pIDI (19) 

The time needed for A^ to perform well can also be estimated. An interesting 
quantity is the expected number of rounds needed to reach the winning zone. Using 
Pn'" > one can show that the r.h.s. of (fTT^ is positive if, and only if 

^ '2pA{2pmax-Pn'') ^ 

Theorem 3 (Time to Win) Let there be sequences X\X2--- over a finite alphabet 
X drawn with probability ^{xi-n) for the first n symbols. In step t we make a bet, 
depending on the history x<t, take some action yt, and observe outcome Xt- Our net 
profit is pt & [Pmax—PA,Pmax]- Thc Kp-systcm / f73)) acts as to maximize the p-expected 
profit. Pn'' is the total and p^'' = -Pn'' is the average expected profit of the first n 
rounds. For the universal and for the optimal informed A^ prediction scheme the 
following holds: 

i) pn' = Pn'' — Oin^^/"^) ^ Pn'' fo^ U ^ OO 

n) n>(^^f-b^ A p^>0 =^ p^'>0 

^ Pn 
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where 6^ = lnw^^ with being the weight ^ of ^ in ^ in the discrete case (and 
as in Theorem\^in the continuous case). 



By dividing (fT^ by n and using Dn<b^ (jH) we see that the leading order of pn^—Pn^ 
is bounded by ^J-ip/^pmaxb^/n, which proves (i). The condition in (ii) is actually a 

weakening of (plj) . Pn^ is trivially positive for Prnm>0, since in this wonderful case 
all profits are positive. For negative Pmin the condition of (ii) implies (j^ . since 
PA>Pmax, and (j20|) implies positive (fT9|) . i.e. Pn*>0, which proves (ii). 

If a winning strategy A with p^> e>0 exists, then is asymptotically also a 
winning strategy with the same average profit. 



4.3 Example 

Let us consider a game with two dice, one with two black and four white faces, the 
other with four black and two white faces. The dealer who repeatedly throws the 
dice uses one or the other die according to some deterministic rule, which correlates 
the throws (e.g. the first die could be used in round t iff the t*^ digit of vr is 7). We 
can bet on black or white; the stake s is 3$ in every round; our return r is 5$ for 
every correct prediction. 

The profit is pt = r5xtyt — s. The coloring of the dice and the selection strategy 
of the dealer unambiguously determine /i. /x(xj|x<t) is | or | depending on which 
die has been chosen. One should bet on the more probable outcome. If we knew 
11 the expected profit per round would be =Pn^ = |r — s = |$ > 0. If we don't 
know /i we should use Levin's universal prior with Dn<b^ = K{fi) -1112, where K{fi) 
is the length of the shortest program coding /i (see Section [2.6|) . Then we know 
that betting on the outcome with higher ^ probability leads asymptotically to the 
same profit (Theorem Efi)) and A^ reaches the winning threshold no later than 
nthresh = Q00lia2-K{fi) (Theorem infii)) or sharper nt/i^esft = 3301n2-i^r(/i) from (j^ . 
where pmax=r — s = 2$ and pA = r = 5$ have been used. 

If the die selection strategy reflected in n is not too complicated, the A^ prediction 
system reaches the winning zone after a few thousand rounds. The number of rounds 
is not really small because the expected profit per round is one order of magnitude 
smaller than the return. This leads to a constant of two orders of magnitude size in 
front of K{fi). Stated otherwise, it is due to the large stochastic noise, which makes 
it difficult to extract the signal, i.e. the structure of the rule /i (see next subsection). 
Furthermore, this is only a bound for the turnaround value of rithresh- The true 
expected turnaround n might be smaller. However, for every game for which there 
exists a computable winning strategy with p^>e>0, A^ is guaranteed to get into 
the winning zone for some n^K{fi). 



18 



Marcus Hutter, Technical Report IDSIA-02-02 



4.4 Information-Theoretic Interpretation 

We try to give an intuitive explanation of Theorem Efu). We know that ^{xt\x^t) 
converges to /i(xt|x<f) for t^oo. In a sense learns /x from past data x<t. The 
information content in fi relative to ^ is < 6^/ln2. One might think of a 

Shannon-Fano prefix code of z/G 7W of length ^bu/lyi2\ which exists since the Kraft 
inequality ^^2-^''''/i°2i <^ 

lyWiy < 1 is satisfied. 6^/ln2 bits have to be learned before 
can be as good as A^. In the worst case, the only information conveyed by Xt 
is in form of the received profit pt- Remember that we always know the profit pt 
before the next cycle starts. 

Assume that the distribution of the profits in the interval [pmin,Pmax] is mainly 
due to noise, and there is only a small informative signal of amplitude p^*". To 
reliably determine the sign of a signal of amplitude p^*" , disturbed by noise of ampli- 
tude Pa, we have to resubmit a bit 0{{pA/Pn'^)'^) times (this reduces the standard 
deviation below the signal amplitude Pn'')- To learn fi, 6^/ln2 bits have to be trans- 
mitted, which requires n >0{{pA/Pn''f)-b^,/ln2 cycles. This expression coincides 
with the condition in (ii). Identifying the signal amplitude with p^*" is the weakest 
part of this consideration, as we have no argument why this should be true. It may 
be interesting to make the analogy more rigorous, which may also lead to a simpler 
proof of (ii) not based on bounds (jl8|) with their rather complex proofs. 

5 Optimality Properties 

In this section we discuss the quality of the universal predictor and the bounds. 
In Section 15.11 we show that there are Ai and E A4 and weights w^, such that 
the derived error bounds are tight. This shows that the error bounds cannot be 
improved in general. In Section 15.21 we show Pareto-optimality of ^ in the sense 
that there is no other predictor which performs at least as well in all environments 
z/gA^ and strictly better in at least one. Optimal predictors can always be based 
on mixture distributions ^. This still leaves open how to choose the weights. In 
Section we give an Occam's razor argument that the choice w,^ = 2~^^'''\ where 
A'(z/) is the length of the shortest program describing u is optimal. 

5.1 Lower Error Bound 

We want to show that there exists a class Ai of distributions such that any predictor 
ignorant of the distribution n^Ai from which the observed sequence is sampled 
must make some minimal additional number of errors as compared to the best 
informed predictor 9^. 

For deterministic environments a lower bound can easily be obtained by a com- 
binatoric argument. Consider a class Ai containing 2" binary sequences such that 
each prefix of length n occurs exactly once. Assume any deterministic predic- 
tor G (not knowing the sequence in advance), then for every prediction xf of B 
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at times t<n there exists a sequence with opposite symbol Xt = l — xf. Hence, 
-E® >E^ = n = \og2\Ai\ is a lower worst case bound for every predictor 6, (this in- 
cludes 6^, of course). This shows that the upper bound <log2|A^| for uniform 
w obtained in the discussion after Theorem |21 is sharp. In the general probabilistic 
case we can show by a similar argument that the upper bound of Theorem |21 is sharp 
for and "static" predictors, and sharp within a factor of 2 for general predictors. 
We do not know whether the factor two gap can be closed. 

Theorem 4 (Lower Error Bound) For every n there is an Ai and nEAi and 
weights Wy such that 



e 



-""^-ef^ = and E^^ - E^^ = S^, + J AE^' + SI 



t '^t 



where En^and E®'^ are the total expected number of errors of and G^, and St 
and Sn are defined in (0). More generally, the equalities hold for any "static" deter- 
ministic predictor 6 for which yf is independent of x<^f For every n and arbitrary 
deterministic predictor G, there exists an M. and fi&Ai such that 




{ii) ef-ef^ > y2stix^t) and E^ - E^^ > '^[Sn + ^J^E^' + S^] 

Proof, (i) The proof parallels and generalizes the deterministic case. Consider a 
class Ai of 2" distributions (over binary alphabet) indexed by a = ai...a„ G {0,1}". 
For each t we want a distribution with posterior probability |(l+5) for Xt = l and 
one with posterior probability |(1 — e) for Xt = l independent of the past x^t with 
0<e<i. That is 

We are not interested in predictions beyond time n but for completeness we may 
define fia to assign probability 1 to a;t = l for all t>n. If fi = fia, the informed scheme 
G^ always predicts the bit which has highest yU-probability, i.e. yf'' = at 

Since E^f^ is the same for all a we seek to maximize E^ for a given predictor G in the 
following. Assume G predicts yf (independent of history x<f). Since we want lower 
bounds we seek a worst case /i. A success yf = Xt has lowest possible probability 
\(l-e) if at = l-yf. 

=^ ef = l-/x,,(yf) = |(l+5) =^ Ef = ^il + e). 

So we have ef — ef** =e and Ef — Eff^ =ne for the regrets. We need to eliminate n 
and e in favor of St, Sn, and Eff". If we assume uniform weights =2^" for all 
we get 

n n 

e(xi:„) = E^M./^a(a:i:„) = 2"" [] E -"«*(^*) = 2"" Jl ^ = 2-", 

« t=latG{0,l} i=l 
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i.e. ^ is an unbiased Bernoulli sequence (^(xj|x<j) = |). 

=^ st{x^t) = "^il - fiatixt))^ = and S'„ = fe^. 

So we have e = y/2st which proves the instantaneous regret formula ef — ef'' = y/2st 
for static 9. Inserting e = \J^Sn into i?®'' and solving w.r.t. y/2n we get y/2n = 

^/S^+\/4En^ + Sn. So we finally get 

- Ef- = ne = /s^v^ = + \/4E®'*5„ + 5^ 

which proves the total regret formula in (i) for static G. We can choose^ yf^ = to 
be a static predictor. Together this shows (i). 

{a) For non-static predictors, at = l~yf in the proof of (z) depends on a;<t, which 
is not allowed. For general, but fixed at we have ef (x<t) = l— yUat(2/f )• This quantity 
may assume any value between — e) and + when averaged over x<f, and 
is, hence of little direct help. But if we additionally average the result also over all 
environments fia, we get 

n n n 

< Ef >. = < EE[ef (x<,)] >a = EE[< efix^t) >a] = EE[|] = 
t=i t=i t=i 

whatever is chosen: a sort of No- Free-Lunch theorem )WM97j . stating that on 
uniform average all predictors perform equally well/bad. The expectation of Ef 
w.r.t. a can only be |n if Ef > |n for some a. Fixing such an a and choosing fi=fia we 

get Ef-Ef- >\ne = \ \Sn^\[^S^S^^ , and similarly ef-ef- >\^ = \ pst{x<t). 

□ 

Since for binary alphabet st = ^af, Theorem |3 also holds with st replaced by 
^af and Sn replaced by Vn- Since dt/st = l + 0{e'^) we have Dn/Sn^l for e^O. 
Hence the error bound of Theorem |21 with Sn replaced by Dn is asymptotically tight 
for Eff^/Dn^oo (which implies e— >0). This shows that without restrictions on 
the loss function which exclude the error loss, the loss bound (I18p can also not be 
improved. Note that the bounds are tight even when Ai is restricted to Markov or 
i.i.d. environments, since the presented counterexample is i.i.d. 

A set A4 independent of n leading to a good (but not tight) lower bound is 
M = {ni,iJ,2} with /ii/2(l|x<f) = |±ef with et = ram{^,^lnw~^ / v^lnt}. For w^^-^w^^ 

and n^oo one can show that Ef^ — Ef^'^ ~ ^E^f^imi^. 

Unfortunately there are many important special cases for which the loss bound 
(I18|) is not tight. For continuous y and logarithmic or quadratic loss function, for 
instance, one can show that the regret Loo — <lni(7~^ < oo is finite |Hut08aj . For 
arbitrary loss function, but fi bounded away from certain critical values, the regret 

^This choice may be made unique by slightly non-uniform Wf^^ =Ylt=i{k^i^^^t)^] with ^<Cl. 
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is also finite. For instance, consider the special error-loss, binary alphabet, and 
|/i(xf|x<t) — 1| >£ for all t and x. 0^ predicts if /x(0|x<f)>|. If also ^(0|x<f)>|, 
then 6^ makes the same prediction as 6^, for ^(0|a;<t) < | the predictions differ. 
In the latter case |^(0|a;<t)— /i(0|a;<t)| >e. Conversely for /i(0|a;<t) < |. So in any 

case ef*— ef^ < ^[^(xt|x<t) — /i(xt|x<j)]^. Using ((Tj) and Theorem [T] we see that 

—E^^' < j2^nw^^ < oo is finite too. Nevertheless, Theorem |3] is important as it 
tells us that bound (fTHj) can only be strengthened by making further assumptions 
on i OT Ai. 

5.2 Pareto Optimality of ^ 

In this subsection we want to establish a different kind of optimality property of ^. 
Let jF(/i,p) be any of the performance measures of p relative to fi considered in the 
previous sections (e.g. St, or Z)„, or L„, ...). It is easy to find p more tailored towards 
p such that J^{p,p) <J^{p,^). This improvement may be achieved by increasing w^, 
but probably at the expense of increasing JF for other u, i.e. !F{i',p) >jF(i/,^) for 
some z/GTVI. Since we do not know p in advance we may ask whether there exists a 
p with better or equal performance for all uEAi and a strictly better performance 
for one z/GTVI. This would clearly render ^ suboptimal w.r.t. to JF. We show that 
there is no such p for most performance measures studied in this work. 

Definition 5 (Pareto Optimality) Let J-'{p,p) be any performance measure of p 
relative to p. The universal prior ^ is called Pareto- optimal w.r.t. T if there is no 
p with J^IujP) < JF(z/,^) for all u^Ai and strict inequality for at least one v. 

Theorem 6 (Pareto Optimality) The universal prior ^ is Pareto- optimal w.r.t. 
the instantaneous and total squared distances St and Sn (0); entropy distances dt 
and Dn errors Ct and En (Q), and losses k and Ln / Ii6|) . 

Proof. We first prove Theorem IHl for the instantaneous expected loss It. We need 
the more general p-expected instantaneous losses 

^t(^<t) ■■= Ep(^*I^<*)W (21) 

for a predictor A. We want to arrive at a contradiction by assuming that ^ is not 
Pareto-optimal, i.e. by assuming the existence of a predictor^ A with l^^<ltu for all 
v&M. and strict inequality for some v. Implicit to this assumption is the assumption 
that and if^ exist. exists iff v{xt\x^t) exists iff z/(x<t)>0 iff Wi,{x^t)>^- 

^According to Definition El we should look for a p, but for each deterministic predictor A there 
exists a p with A = Ap. 
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The two equalities follow from inserting into ()21|) . The strict inequality follows 
from the assumption and Wy{x^t) > 0. The last inequality follows from the fact 
that minimizes by definition (|T5|) the .^-expected loss (similarly to (|T7j) ). The 



contradiction /t^<^t^ proves Pareto-optimality of C, w.r.t. k. 

In the same way we can prove Pareto-optimality of ^ w.r.t. the total loss L„ by 
defining the p-expected total losses 



f=l X<t t = l Xl.,t 

for a predictor A, and by assuming L^^, < Lnu for all u and strict inequality for 
some z/, from which we get the contradiction L^^ = J2u'>JJuL^u^^ 
with the help of ([Q). The instantaneous and total expected errors Ct and En can be 
considered as special loss functions. 

Pareto-optimality of ^ w.r.t. St (and hence Sn) can be understood from geomet- 
rical insight. A formal proof for st goes as follows: With the abbreviations i = Xt, 
yi,i = i'{xt\x<^t), Zi = ^{xt\x<t), ri = p(xt|x<t), and Wu = Wu{x<^t)>0 we ask for a vector 
r with Y.i{yui-riY <Y.i{yui- ZiY Vz/. This implies 



> ^Wy\Y^{yy,-r>)^ ~Y^{yyi-z,f'\ = ^wJ\Y^-2yyiri^ ^2y 

V i i V i 

= E -2ziri + + 2% - 2;,^ = J^i'^i'^^)^ ^ 

i i 

where we have used J^u'^u = 1 ^ind J^u'^uyui = Zi dHj). > J^Af^i — ZiY > implies 
r = z proving unique Pareto-optimality of ^ w.r.t. s^. Similarly for dt the assumption 
'EiVuM^- < Y.iyuM^ Vz/ implies 



Zi ^— \ , Zi 

— = > ^jln — 

r,; , Ti 



which implies r = z proving unique Pareto-optimality of ^ w.r.t. dt- The proofs for 
Sn and Dn are similar. □ 

We have proven that ^ is uniquely Pareto-optimal w.r.t. Sf, Sn-, dt and Dn- In 
the case of et, h and L„ there are other P7^^ with jF(z/,p) = jF(z/,^)Vz/, but the 
actions/predictions they invoke are unique {yt'' = yt^) (if ties in argmax^^ are broken 
in a consistent way), and this is all that counts. 

Note that ^ is not Pareto-optimal w.r.t. to all performance measures. Counterex- 
amples can be given for J^ii'jC) = J2xtW{^t\x<:t) —^{xt\x^t)\" for a^2, e.g. at and Vn- 
Nevertheless, for all performance measures which are relevant from a decision theo- 
retic point of view, i.e. for all loss functions It and L„, ^ has the welcome property 
of being Pareto-optimal. 

Pareto-optimality should be regarded as a necessary condition for a prediction 
scheme aiming to be optimal. From a practical point of view a significant decrease 
of JF for many z/ may be desirable even if this causes a small increase of JF for a 
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few other v. One can show that such a "balanced" improvement is (not) possible 
in the following sense: For instance, by using A instead of A^, the w,y-expected loss 

may increase or decrease, i.e. L^j^>Lnu, but on average, the loss can not decrease, 
since J2u'^u[Lnu~Lni>] = L^^ — L^^ >0, where we have used linearity of Lnp in p and 

<L^^. In particular, a loss increase by an amount Aa in only a single environ- 
ment A, can cause a decrease by at most the same amount times a factor — in some 
other environment 77, i.e. a loss increase can only cause a smaller decrease in simpler 
environments, but a scaled decrease in more complex environments. We do not 
regard this as a "No Free Lunch" (NFL) theorem |WM97j . Since most environments 
are completely random, a small concession on the loss in each of these completely un- 
interesting environments provides enough margin to yield distinguished performance 
on the few non-random (interesting) environments. Indeed, we would interpret the 
NFL theorems for optimization and search in |WM97j as balanced Pareto-optimality 
results. Interestingly, whereas for prediction only Bayes-mixes are Pareto-optimal, 
for search and optimization every algorithm is Pareto-optimal. 

The term Pareto-optimal has been taken from the economics literature, but there 
is the closely related notion of unimprovable strategies |BM98j or admissible esti- 
mators |Fer67j in statistics for parameter estimation, for which results similar to 
Theorem ini exist. Furthermore, it would be interesting to show under which condi- 
tions, the class of all Bayes-mixtures (i.e. with all possible values for the weights) is 
complete in the sense that every Pareto-optimal strategy can be based on a Bayes- 
mixture. Pareto-optimality is sort of a minimal demand on a prediction scheme aim- 
ing to be optimal. A scheme which is not even Pareto-optimal cannot be regarded 
as optimal in any reasonable sense. Pareto-optimality of ^ w.r.t. most performance 
measures emphasizes the distinctiveness of Bayes-mixture strategies. 



5.3 On the Optimal Choice of Weights 

In the following we indicate the dependency of ^ on w explicitly by writing C,w We 
have shown that the A^^ prediction schemes are (balanced) Pareto-optimal, i.e. that 
no prediction scheme A (whether based on a Bayes mix or not) can be uniformly 
better. Least assumptions on the environment are made for Ad which are as large as 
possible. In Section we have discussed the set Ai of all enumerable semimeasures 
which we regarded as sufficiently large from a computational point of view (see 
jSchn2alllCT3bl for even larger sets, but which are still in the computational realm). 
Agreeing on this Ai still leaves open the question of how to choose the weights (prior 
beliefs) w,^, since every with Wu>OWh' is Pareto-optimal and leads asymptotically 
to optimal predictions. 

We have derived bounds for the mean squared sum S'^™ < Inw'^ and for the 
loss regret htv" —L^';^ < 2 \nw^^ + 2yJlnw~^L^^. All bounds monotonically decrease 
with increasing Wi,. So it is desirable to assign high weights to all ueM.. Due to 
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the (semi) probability constraint YliyW^ < 1 one has to find a compromise.^ In the 
following we will argue that in the class of enumerable weight functions with short 
program there is an optimal compromise, namely Wy = 2~^^'^\ 

Consider the class of enumerable weight functions with short programs, namely 
V:={v(^y.M^IR ^ with Y.uVu<l and K{v)=0{l)}. Let w^:=2-^^''^ and V(.)GV. 
Corollary 4.3.1 of p255] says that K{x)<-\og2P{x) + K{P)+0{l) for all x 

if P is an enumerable discrete semimeasure. Identifying P with v and x with (the 
program index describing) v we get 

Inw;^ < In^;;^ + 0(1). 

This means that the bounds for depending on \n.w~^ are at most 0(1) larger than 
the bounds for depending on lnt>~^ So we lose at most an additive constant of 
order one in the bounds when using instead of ^y. In using we are on the safe 
side, getting (within 0(1)) best bounds for a// environments. 

Theorem 7 (Optimality of universal weights) Within the set V of enumerable 
weight functions with short program, the universal weights Wu = 2~^^'^'> lead to the 
smallest loss bounds within an additive (to Inw^^) constant in all enumerable envi- 
ronments. 

Since the above justifies the use of ^w, and assigns high probability to an envi- 
ronment if and only if it has low (Kolmogorov) complexity, one may interpret the 
result as a justification of Occam's razor.^ But note that this is more of a boot- 
strap argument, since we implicitly used Occam's razor to justify the restriction 
to enumerable semimeasures. We also considered only weight functions v with low 
complexity K[v) =0(1). What did not enter as an assumption but came out as a 
result is that the specific universal weig hts 10^ = 2-^^''^ are optimal. 

On the other hand, this choice for is not unique (even not within a constant 
factor). For instance, for < Vu = 0{1) for i' = and Vi, arbitrary (e.g. 0) for 
all other z/, the obvious dominance ^y>VyV can be improved to S,u>c-WyV, where 
0<c=O(l) is a universal constant. Indeed, formally every choice of weights f,y>0 Vz/ 
leads within a multiplicative constant to the same universal distribution, but this 
constant is not necessarily of "acceptable" size. Details will be presented elsewhere. 

^AU results in this paper have been stated and proven for probabiUty measures /i, ^ and Wi,, 
i.e. X]a;i.jC(2^i:t) = Sa:i.tM(2^i:*) = Si/^i' = 1- Other hand, the class M. considered here is 

the class of all enumerable semimeasures and 'Y^^Wi, < 1. In general, each of the following 4 items 
could be semi (<) or not (=): (^, ^, w^), where M. is semi if some elements are semi. Six 
out of the 2"^ combinations make sense. Convergence (Theorem QJ, the error bound (Theorem [SJ, 
the loss bound H18|) . as well as most other statements hold for (<,=,<,<), but not for (<,<,<,<). 
Nevertheless, ^ — > holds also for (<,<,<,<) with maximal /i semi-probability, i.e. fails with 
semi-probability 0. 

^The only i/ direction can be shown by a more easy and direct argument [Schn2aj . 
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6 Miscellaneous 



This section discusses miscellaneous topics. Section 16.11 generalizes the setup to 
continuous probability classes = {fio} consisting of continuously parameterized 
distributions fig with parameter 9 G IR'^. Under certain smoothness and regularity 
conditions a bound for the relative entropy between fi and ^, which is central for all 
presented results, can still be derived. The bound depends on the Fisher information 
of fi and grows only logarithmically with n, the intuitive reason being the necessity 
to describe 6 to an accuracy 0(n~^/^). Section IFT^ describes two ways of using the 
prediction schemes for partial sequence prediction, where not every symbol needs to 
be predicted. Performing and predicting a sequence of independent experiments and 
online learning of classification tasks are special cases. In Section 16.31 we compare 
the universal prediction scheme studied here to the popular predictors based on 



expert advice (PEA) [LWSQl IV^^ ILWOl ITHoTl IHKW98I IRW99j . Although the 



algorithms, the settings, and the proofs are quite different, the PEA bounds and 
our error bound have the same structure. Finally, in Section ing we outline possible 
extensions of the presented theory and results, including infinite alphabets, delayed 
and probabilistic prediction, active systems influencing the environment, learning 
aspects, and a unification with PEA. 

6.1 Continuous Probability Classes A4 

We have considered thus far countable probability classes Ai, which makes sense 
from a computational point of view as emphasized in Section ESI On the other hand 
in statistical parameter estimation one often has a continuous hypothesis class (e.g. 
a Bernoulli(6') process with unknown 6e [0,1]). Let 



be a family of probability distributions parameterized by a dimensional continuous 
parameter 9. Let fi = fig^ eA4 he the true generating distribution and 6^0 be in the 
interior of the compact set 9. We may restrict Ai to a countable dense subset, like 
{fj^e} with computable (or rational) 6. If is itself a computable real (or rational) 
vector then Theorem Q and bound (|18p apply. From a practical point of view the 
assumption of a computable 6o is not so serious. It is more from a traditional analysis 
point of view that one would like quantities and results depending smoothly on 6 
and not in a weird fashion depending on the computational complexity of 6. For 
instance, the weight w{9) is often a continuous probability density 



The most important property of ^ used in this work was ^(xi:„) > Wi,-z/(xi:„) which 
has been obtained from by dropping the sum over u. The analogous construction 
here is to restrict the integral over B to a small vicinity Ns of 6. For sufficiently 



M := {fie-OeQC R'^} 





(22) 
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smooth Ho and ^(6*) we expect i{xi;n)^\Ns^\-w{6)- nei^xi.n) , where \N5j\ is the volume 
of Ns^. This in turn leads to Dn^^^w^"^ + \n\Ns„\^^ , where w^: = w{6q). Ns^ should 
be the largest possible region in which Infie is approximately flat on average. The 
averaged instantaneous, mean, and total curvature matrices of In/x are 

jt{x<t) ■= EtVeln/ie(xt|x<t)V^ln/i0(xt|x<t)|0=eo, J„ := ^Jn 

n 

Jn := ^E<tjf(x<t) = Ei;„Veln/xe(a;i:n)Vjln/i0(xi:„)|e=eo 

i=l 

They are the Fisher information of /i and may be viewed as measures of the para- 
metric complexity of /ig at 6 = 9q. The last equality can be shown by using the 
fact that the /i-expected value of Vln/x- V"^ln/i coincides with —W^lnfi (since X is 
finite) and a similar equality as in ^ for Dn- 

Theorem 8 (Continuous Entropy Bound) Let be twice continuously differ- 
entiahle at 6q E Q IR^ and w{9) be continuous and positive at 9q. Furthermore 
we assume that the inverse of the mean Fisher information matrix {jn)~^ exists, is 
bounded for n— >oo, and is uniformly (in n) continuous at 6q. Then the relative 
entropy Dn between fi = /igg and ( defined in ) can be bounded by 

Dn := E,.,nln!g^^ < In + f In |^ + | In det J„ + o(l) =: b, 

where w^ = w{6o) is the weight density of ^ in and o(l) tends to zero for 
n-^oo. 



Proof sketch. For independent and identically distributed distributions fie{xi:n) = 
fj,g{xi)-...-fie{xn) V6' this bound has been proven in |CB90t Theorem 2.3]. In this case 
J^'-^^^'^^(9o) =jn=jn independent of n. For stationary (fc^'^-order) Markov processes 
Jn is also constant. The proof generalizes to arbitrary j^g by replacing J^^^^^^(9o) 
with Jn everywhere in their proof. For the proof to go through, the vicinity A'"^^ : = 
{9: 116* — ^olljn^'^n} of 6*0 must contract to a point set {6*0} for n^oo and 5^— *>0. J„ 
is always positive semi-definite as can be seen from the definition. The boundedness 
condition of implies a strictly positive lower bound independent of n on the 
eigenvalues of j„ for all sufficiently large n, which ensures Ns^ ^{^o}- The uniform 
continuity of j„ ensures that the remainder o(l) from the Taylor expansion of Dn 
is independent of n. Note that twice continuous differentiability of Dn at ^0 [CB90t 
Condition 2] follows for finite X from twice continuous differentiability of ng. Under 
some additional technical conditions one can even prove an equality Dn = lnw~^ + 
|ln2^ + ilndetJ„+o(l) for the i.i.d. case CB90, (1.4)], which is probably also valid 
for general ^. □ 

The \rLW~^ part in the bound is the same as for countable A4. The ^Iutt- can 
be understood as follows: Consider 9 G [0,1) and restrict the continuous to 6^ 
which are finite binary fractions. Assign a weight w{9) ~ 2~' to a 6' with binary 
representation of length /. D„</-ln2 in this case. But what if 9 is not a finite 
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binary fraction? A continuous parameter can typically be estimated with accuracy 
0(?2^^/^) after n observations. The data do not allow to distinguish a 9 from the 
true 6 if |^ — 6'| <0(n~^/^). There is such a 6 with binary representation of length 
l = \og20{y/n). Hence we expect D„<|ln?T,+0(l) or |lnn+0(l) for a (i-dimensional 
parameter space. In general, the 0(1) term depends on the parametric complexity of 
and is explicated by the third ^Indet term in Theorem |H1 See |CB901 p454] for 
an alternative explanation. Note that a uniform weight w{9) = j^^ does not lead to a 
uniform bound unlike the discrete case. A uniform bou nd is obta ined for Bernando's 
(or in the scalar case Jeffreys') reference prior w{9) ^JdetJ^{9) if joo exists fiis96] . 

For a finite alphabet X we consider throughout the paper, jf^ < oo independent 
of t and x<t in case of i.i.d. sequences. More generally, the conditions of Theorem 
IS] are satisfied for the practically very important class of stationary (fc-th order) 
finite-state Markov processes {k = is i.i.d.). 

Theorem |S1 shows that Theorems ^ and |21 are also applicable to the case of 
continuously parameterized probability classes. Theorem|H]is also valid for a mixture 
of the discrete and continuous cases ^ = J2aIdO w"'{9) fig with J^aJdO w"'{9) = 1. 

6.2 Further Applications 

Partial sequence prediction. There are (at least) two ways to treat partial 
sequence prediction. With this we mean that not every symbol of the sequence needs 
to be predicted, say given sequences of the form ziXi...ZnXn we want to predict the 
x's only. The first way is to keep the Ap prediction schemes of the last sections 
mainly as they are, and use a time dependent loss function, which assigns zero loss 
ily = at the z positions. Any dummy prediction y is then consistent with (fT3|) . The 
losses for predicting x are generally non-zero. This solution is satisfactory as long as 
the z's are drawn from a probability distribution. The second (preferable) way does 
not rely on a probability distribution over the z. We replace all distributions p{xi-n) 
(p= /i, u, ^) everywhere by distributions p{xi:n\zi:n) conditioned on Zi.n- The Zi.n 
conditions cause nowhere problems as they can essentially be thought of as fixed (or 
as oracles or spectators). So the bounds in Theorems ^.jH] also hold in this case for 
all individual z's. 

Independent experiments and classification. A typical experimental situation 
is a sequence of independent (i.i.d) experiments, predictions and observations. At 
time t one arranges an experiment Zt (or observes data Zt), then tries to make a 
prediction, and finally observes the true outcome Xt- Often one has a parameterized 
class of models (hypothesis space) fig{xt\zt) and wants to infer the true 9 in order 
to make improved predictions. This is a special case of partial sequence prediction, 
where the hypothesis space A4 = {fig{xi;n\zi:n) = peixilzi) ■ fj,g{xn\zn)} consists of 
i.i.d. distributions, but note that ^ is not i.i.d. This is the same setting as for on-line 
learning of classification tasks, where a zgZ should be classified ex. 
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6.3 Prediction with Expert Advice 

There are two schools of universal sequence prediction: We considered expected 
performance bounds for Bayesian prediction based on mixtures of environments, 
as is common in information theory and statistics |MF98j . The other approach 



are predictors based on expert advice (PEA) with worst case loss bounds in the 
spirit of Littlestone, Warmuth, Vovk and others. We briefly describe PEA and 
compare both approaches. For a more comprehensive comparison see |MF98j . In 
the following we focus on topics not covered in |MF98j . PEA was invented in 
|LW891 ILW94j and |Vov92j and further developed in [HSffl pKW98l IKW99j and 



by many others. Many variations known by many names (prediction/learning with 
expert advice, weighted majority/average, aggregating strategy, hedge algorithm, ...) 
have meanwhile been invented. Early works in this direction are |iDaw84, R.is89] . See 
|Vov99j for a review and further references. We describe the setting and basic idea 
of PEA for binary alphabet. Consider a finite binary sequence xiX2---Xn G {0,1}" 
and a finite set S of experts e&S making predictions xf in the unit interval [0,1] 
based on past observations xiX2...Xt-i. The loss of expert e in step t is defined as 
\xt — x^\. In the case of binary predictions G {0,1}, — xf | coincides with our 
error measure ((Tj). The PEA algorithm p^n combines the predictions of all experts. 
It forms its own prediction^° xf G [0,1] according to some weighted average of the 
expert's predictions xf. There are certain update rules for the weights depending 
on some parameter (3. Various bounds for the total loss Lp{x.) := I]"=i|xt — xf | of 
PEA in terms of the total loss L^{x.) :=Y^t=i\xt — xl\ of the best expert eeS have 
been proven. It is possible to fine tune jS and to eliminate the necessity of knowing 
n in advance. The first bound of this kind has been obtained in |CB97j : 



Lp(x) < L^(x) + 2.81n|^| +4^Le(x)ln|^|. (23) 

The constants 2.8 and 4 have been improved in |AG00l lYEYOlj . The last bound in 
Theorem 121 with 5'„<D„<ln|A^| for uniform weights and with E^/^ increased to 
reads 

< E^ + 2\n\M\ + 2JE^ln\M\. 



It has a quite similar structure as ((221), although the algorithms, the settings, 
the proofs, and the interpretation are quite different. Whereas PEA performs well 
in any environment, but only relative to a given set of experts S, our 6^ predictor 
competes with the best possible 9^ predictor (and hence with any other 9 predictor), 
but only in expectation and for a given set of environments Ai. PEA depends 
on the set of experts, 9^ depends on the set of environments The basic PjSn 
algorithm has been extended in different directions: incorporation of different initial 
weights (|£'|'^w;~^) [LW89t .Vov92j . more general loss functions 'HK W98j . continuous 
valued outcomes [HKW98] . and multi-dimensional predictions [KW991 (but not yet 



^°The original PEA version |LW89j had discrete prediction e{0,l} with (necessarily) twice as 
many errors as the best expert and is only of historical interest any more. 
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for the absolute loss). The work of |Yam98j lies somewhat in between PEA and 
this work; "PEA" techniques are used to prove expected loss bounds (but only for 
sequences of independent symbols/experiments and limited classes of loss functions). 
Finally, note that the predictions of PEA are continuous. This is appropriate for 
weather forecasters which announce the probability of rain, but the decision to wear 
sunglasses or to take an umbrella is binary, and the suffered loss depends on this 
binary decision, and not on the probability estimate. It is possible to convert the 
continuous prediction of PEA into a probabilistic binary prediction by predicting 
1 with probability G [0,1]. — | is then the probability of making an error. 
Note that the expectation is taken over the probabilistic prediction, whereas for 
the deterministic algorithm the expectation is taken over the environmental 
distribution /i. The multi-dimensional case |KW99j could then be interpreted as a 
(probabilistic) prediction of symbols over an alphabet A' = {0,1}'^, but error bounds 
for the absolute loss have yet to be proven. In jFS97j the regret is bounded by 



ln|£^| + Y2Lln|£^| for arbitrary unit loss function and alphabet, where L is an upper 
bound on L^, which has to be known in advance. It would be interesting to generalize 
PEA and bound ()23|) to arbitrary alphabet and weights and to general loss functions 
with probabilistic interpretation. 



6.4 Outlook 

In the following we discuss several directions in which the findings of this work may 
be extended. 

Infinite alphabet. In many cases the basic prediction unit is not a letter, but a 
number (for inducing number sequences), or a word (for completing sentences), or 
a real number or vector (for physical measurements). The prediction may either be 
generalized to a block by block prediction of symbols or, more suitably, the finite 
alphabet X could be generalized to countable (numbers, words) or continuous (real 
or vector) alphabets. The presented theorems are independent of the size of A" and 
hence should generalize to countably infinite alphabets by appropriately taking the 
limit \X\ —>-(yo and to continuous alphabets by a denseness or separability argument. 
Since the proofs are also independent of the size of X we may directly replace all 
finite sums over X by infinite sums or integrals and carefully check the validity of 
each operation. We expect all theorems to remain valid in full generality, except for 
minor technical existence and convergence constraints. 

An infinite prediction space y was no problem at all as long as we assumed the 
existence of y^'' G 3^ (fT3j) . In case G y does not exist one may define G 3^ in a 
way to achieve a loss at most et = o{t~^) larger than the infimum loss. We expect a 
small finite correction of the order of e = J^t^i^t < cxd in the loss bounds somehow. 

Delayed & probabilistic prediction. The Ap schemes and theorems may be 
generalized to delayed sequence prediction, where the true symbol Xt is given only 
in cycle t+d. A delayed feedback is common in many practical problems. We expect 
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bounds with D„ replaced by d-Dn- Further, the error bounds for the probabihstic 
suboptimal ^ scheme defined and analyzed in [HutOlbj can also be generalized to 
arbitrary alphabet. 

More active systems. Prediction means guessing the future, but not infiuencing 
it. A small step in the direction of more active systems was to allow the A system to 
act and to receive a loss ixtyt depending on the action and the outcome Xf. The 
probability fi is still independent of the action, and the loss function has to be 
known in advance. This ensures that the greedy strategy (|15p is optimal. The loss 
function may be generalized to depend not only on the history x^t, but also on the 
historic actions y^t with still independent of the action. It would be interesting 
to know whether the scheme A and/or the loss bounds generalize to this case. The 
full model of an acting agent infiuencing the environment has been developed in 
|Hut01cj . Pareto-optimality and asymptotic bounds are proven in |Hut02j . but a lot 
remains to be done in the active case. 

Miscellaneous. Another direction is to investigate the learning aspect of universal 
prediction. Many prediction schemes explicitly learn and exploit a model of the 
environment. Learning and exploitation are melted together in the framework of 
universal Bayesian prediction. A separation of these two aspects in the spirit of hy- 
pothesis learning with MDL |VL00j could lead to new insights. Also, the separation 
of noise from useful data, usually an important issue |GTV01j . did not play a role 
here. The attempt at an information theoretic interpretation of Theorem |21 may be 
made more rigorous in this or another way. In the end, this may lead to a simpler 
proof of Theorem 01 and maybe even for the loss bounds. A unified picture of the 
loss bounds obtained here and the loss bounds for predictors based on expert advice 
(PEA) could also be fruitful. Yamanishi |Yam98j used PEA methods to prove ex- 
pected loss bounds for Bayesian prediction, so maybe the proof technique presented 
here could be used vice versa to prove more general loss bounds for PEA. Maximum- 
likelihood or MDL predictors may also be studied. For instance, 2~^^^^ (or some of 
its variants) is a close approximation of C,u, so one may think that predictions based 
on (variants of) K may be as good as predictions based on ^(/, but it is easy to see 
that K completely fails for predictive purposes. Also, more promising variants like 
the monotone complexity Km and universal two-part MDL, both extremely close 
to ^u, fail in certain situations |Hut03cj . Finally, the system should be applied to 
specific induction problems for specific Ml with computable ^. 



7 Summary 

We compared universal predictions based on Bayes-mixtures ^ to the infeasible in- 
formed predictor based on the unknown true generating distribution /i. Our main 
focus was on a decision-theoretic setting, where each prediction yt&X (or more 
generally action yt G y) results in a loss l^^yi if Xt is the true next symbol of the 
sequence. We have shown that the A^ predictor suffers only slightly more loss than 
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the predictor. We have shown that the derived error and loss bounds cannot 
be improved in general, i.e. without making extra assumptions on £, fi, Ai, or w^. 
Within a factor of 2 this is also true for any fi independent predictor. We have also 
shown Pareto-optimality of ^ in the sense that there is no other predictor which 
performs at least as well in all environments u and strictly better in at least 
one. Optimal predictors can (in most cases) be based on mixture distributions C,- 
Finally we gave an Occam's razor argument that the universal prior with weights 
w^ = 2--^H is optimal, where -R'(z^) is the Kolmogorov complexity of u. Of course, 
optimality always depends on the setup, the assumptions, and the chosen criteria. 
For instance, the universal predictor was not always Pareto-optimal, but at least for 
many popular, and for all decision theoretic performance measures. Bayes predictors 
are also not necessarily optimal under worst case criteria |CBL01j . We also derived a 
bound for the relative entropy between ^ and fi in the case of a continuously param- 
eterized family of environments, which allowed us to generalize the loss bounds to 
continuous M.. Furthermore, we discussed the duality between the Bayes-mixture 
and expert-mixture (PEA) approaches and results, classification tasks, games of 
chances, infinite alphabet, active systems influencing the environment, and others. 
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